Measurement of performance of communications systems

ABSTRACT

The subjective quality of an audio-visual stimulus is measured by measuring the actual synchronisation errors between the audio and visual elements of the stimulus, identifying characteristics of audio and visual cues in the stimulus, and generating a measure of subjective quality from said errors and characteristics. The nature of the cue has an effect on the perceptual significance of a given value of synchronisation error, and this can be used to relax tolerances to such errors when appropriate, or to provide an accurate measure of the quality of the signal as it would be perceived by a human subject.

FIELD OF THE INVENTION

This invention relates to signal processing. It is of application to thetesting of communications systems and installations, and to other usesas will be described. The term “communications system” covers telephoneor television networks and equipment, public address systems, computerinterfaces, and the like.

BACKGROUND OF THE INVENTION

It is desirable to use objective, repeatable, performance metrics toassess the acceptability of performance at the design, commissioning,and monitoring stages of communications services provision. However, akey aspect of system performance is the measurement of subjectivequality, which is central in determining customer satisfaction withproducts and services. The complexity of modern communications andbroadcast systems, and in particular the use of data reductiontechniques, renders conventional engineering metrics inadequate for thereliable prediction of perceived performance. Subjective testing usinghuman observers is expensive, time consuming and often impractical,particularly for field use. Objective assessment of the perceived(subjective) performance of complex systems has been enabled by thedevelopment of a new generation of measurement techniques, which emulatethe properties of the human senses. For example, a poor value of anobjective measure such as signal-to-noise performance may result from aninaudible distortion. A model of the masking that occurs in hearing iscapable of distinguishing between audible and inaudible distortions.

The use of models of the human senses to provide improved understandingof subjective performance is known as perceptual modelling. The presentapplicants have a series of previous patent applications referring toperceptual models, and test signals suitable for non-linear speechsystems, including WO 94/00922, WO 95/01011 and WO 95/15035.

To determine the subjective relevance of errors in audio systems, andparticularly speech systems, assessment algorithms have been developedbased on models of human hearing. The prediction of audible differencesbetween a degraded signal and a reference signal can be thought of asthe sensory layer of a perceptual analysis, while the subsequentcategorisation of audible errors according to their subjective effect onoverall signal quality can be thought of as the perceptual layer.

An approach similar to this auditory perceptual model has also beenadopted for a visual perceptual model. In this case the sensory layerreproduces the gross psychophysics of the sensory mechanisms, inparticular spatio-temporal sensitivity (known as the human visualfilter), and masking due to spatial frequency, orientation and temporalfrequency.

A number of visual perceptual models are under development and severalhave been proposed in the literature.

The subjective performance of multi-modal systems depends not only onthe quality of the individual audio and video components, but also oninteractions between them. Such effects include “quality mis-match”, inwhich the quality presented in one modality influences perception inanother modality. This effect increases with the quality mis-match.

The information content of the signal is also important. This is relatedto the task undertaken but can vary during the task. For presentpurposes, “content” refers to the nature of the audio-visual materialduring any given part of the task.

The type of task or activity undertaken also has a substantial effect onperceived performance. As a simple example, if the video componentdominates for a given task then errors in the video part will be ofgreatest significance. At the same time audio errors which have highattentional salience (are “attention grabbing”) will also becomeimportant. The nature of the task undertaken influences the split ofattention between the modalities, although this may also vary morerandomly if the task is undemanding.

However, important though these factors are, they are in generaldifficult to define, and to use for making objective measurements.Nevertheless, the inventor has identified some cross-modal effects whichcan be derived from objective measurements.

SUMMARY OF THE INVENTION

A further application of the invention is in the real-time generation ofaudio-visual events in virtual environments, in particular real-timeoptimisation of synthetic people such as animated taking faces. Theprocess may be used in particular for the matching of synthetic headvisemes (mouth shape) transitions with the acoustic waveforms datagenerating the speech to be represented, thereby generating morerealistic avatars.

According to the invention there is provided a method of determining thesubjective quality of an audio-visual stimulus, comprising the steps of:

measuring the actual synchronisation errors between the audio and visualelements of the stimulus,

identifying characteristics of audio and visual cues in the stimulus,and generating a measure of subjective quality from said errors andcharacteristics.

According to another aspect there is provided apparatus for determiningthe subjective quality of an audio-visual stimulus, comprising means formeasuring the actual synchronisation errors between the audio and visualelements of the stimulus, means for the identification ofcharacteristics of audio and visual cues in the stimulus, and means forgenerating a measure of subjective quality from said synchronisationerrors and characteristics.

It has been observed experimentally that human subjects have differentsensitivities to a given synchronisation error, depending on the type ofcue with which it is associated. Thus, poorly-synchronised stimulicontaining certain cue types will be perceived as of lower quality thanequally poorly-synchronised stimuli containing other cue types.Synchronisation tolerances have been an essential consideration intelevision broadcasting for many years. However, for emergingtelepresence technologies, synchronisation must be dynamicallycontrolled. Audio/video synchronisation error detection is dependent onthe task undertaken, the nature of the stimulus (content) and whetherthe error results in the audio leading or lagging the video [ITU-TRecommendation J.100, “Tolerances for transmission time differencesbetween vision and sound components of a television signal”, 1990].

Results to be presented later in this specification illustrate that thesynchronisation tolerances can be relaxed for certain types of contentand that the subjectivity of synchronisation error remains relativelylow over a much greater range of values for these types.

Although, in general, information content is not measurable by anobjective test, certain cue types have been identified on which humansensitivity to synchronisation error depends, and which aredistinguishable by such tests.

Synchronisation errors are also relatively easy to measure, so theinvention allows a network operator to determine by objective measureswhether the current error is perceptually significant, having regard tothe nature of the cue.

The characteristics of the audio and visual cues are preferably used togenerate one or more synchronisation error tolerances values, which maycorrespond to different degrees of perceptual error, for example asmeasured by human subjects. The audio-visual stimulus can be monitoredfor occurrences by human subjects. The audio-visual stimulus can bemonitored for occurrences of synchronisation errors exceeding suchtolerances values to provide a quantitative output. The means generatingthe stimulus may be controlled dynamically to maintain thesynchronisation in a predetermined relationship with the said tolerancevalues, for example by buffering the earlier-arriving stimulus or byomitting elements from the later-arriving one to bring them intosynchronism. Maintenance of synchronisation can make considerabledemands on an audio-visual system. Buffering requires a memory capacity.Alternatively, if channels are congested, data packets of one or otherchannel (sound or vision) may have to be sacrificed to maintainsynchronisation to a given level, reducing the signal quality of thatchannel. Therefore, if a more relaxed tolerance level can be applied atcertain times, greater synchronisation errors can be allowed, therebyreducing the required channel capacity and/or the amount of lost data.

Where there are several channels in use, each carrying differentstimulus types, they may be controlled such that they all have the sameperceptual quality value, although the synchronisation errors themselvesmay be different.

A further application of the invention is in the real-time generation ofaudio-visual events in virtual environments, in particular real-timeoptimisation of synthetic people such as animated talking faces. Theprocess may be used in particular for the matching of synthetic headvisemes (mouth shape) transitions with the acoustic waveform datagenerating the speech to be represented, thereby generating orerealistic avatars.

BRIEF DESCRIPTION OF THE DRAWINGS

An embodiment of the invention will now be described, by way of exampleonly, with reference to the Figures.

FIG. 1 shows in schematic form the principal components of amulti-sensory perceptual measurement system.

FIG. 2 shows a synchronisation perception measurement component of thesystem of FIG. 1.

FIGS. 3 and 4 illustrate experimental data indicative of the behaviourmodelled by the synchronisation measurement component.

FIG. 5 illustrates schematically the use of the system of FIG. 2 ingenerating visemes for an avatar.

DETAILED DESCRIPTION OF THE INVENTION

One key component of the multi-model of the present invention issynchronisation. This part of the model is shown in FIG. 2. The degreeof synchronisation between the inputs is determined in a synchronisationmeasurement unit 38. This takes inputs from the visual sensory layer(input 38 v) and the audible sensory layer (input 38 a) relating to therespective delays in the two signals. The synchronisation measurementunit 38 determines the differences in these two delays and generates anoutput 38 s representative of the relative delay between the twosignals. This, rather than the absolute delay in either signal, is theperceptually significant. Such lack of synchronisation has beendetermined in prior art systems but, as will be discussed, theperceptual importance of such synchronisation errors varies according tothe nature of the stimulus.

A suitable architecture for a multi-sensory model is shown in FIG. 1.The main components are:

auditory and visual sensory models 10, 20;

a cross-modal model 30, The cross-modal 30 includes a synchronisationperceptual model shown in detail in FIG. 2

a scenario-specific perceptual layer 40.

An auditory sensory layer model component 10 comprises an input 11 forthe audio stimulus, which is provided to an auditory sensory layer model12. The auditory model 12 measures the perceptual importance of thevarious auditory bands and time elements of the stimulus, and generatesan output 16 representative of the audible error as a function ofauditory band (pitch) and time. This audible error may be derived bycomparison of the perceptually modified audio stimulus 13 and areference signal 14, the difference being determined by a subtractionunit 15 to provide an output 16 in the form of a matrix of subjectiveerror as a function of auditory band and time, defined by a series ofcoefficient E_(da1), E_(da2), . . . , E_(dan). Alternatively the modelmay produce the output 16 without the use of a reference signal, forexample according to the method described in International patentspecification number WO96/06496.

A similar process takes place with respect to the visual sensory layermodel 20. An input 21 for the visual stimulus is provided to a visualsensory layer model 22, which generates an output 26 representative ofthe visible error. This error may be derived by comparison of theperceptually modified visual stimulus 23 nd a reference signal 24, thedifference being determined by a subtraction unit 25 to provide anoutput 28 in the form of a matrix of subjective error, defined by aseries of coefficients E_(dv1), E_(dv2), . . . , E_(dvn). However, inthis context a further step is required. The image generated by thevisual sensor layer model 22 is also analysed in an image decompositionunit 27 to identify elements in which errors are particularlysignificant, and weigthed accordingly, as described in internationalpatent specification number WO97/32428. This provides a weightingfunction for those elements of the image which are perceptually the mostimportant. In particular, boundaries are perceptually more importantthan errors within the body of an image element. The weighting functionsgenerated in the weighting generator 28 are then applied to the output26 in a visible error calculation unit 29 to produce a “visible errormatrix” analogous to that of the audible error matrix described above.The matrix can be defined by a series of coefficients E_(dv1), E_(dv2),. . . , E_(dvn). Images are themselves two-dimensional, so for a movingimage the visible error matrix will have at least three dimensions.

It should also be noted that the individual coefficients in the audibleand visible error matrices may be vector properties.

There are a number of cross-modal effects which can affect the perceivedquality of the signal. The effects to be modelled by the cross-modalmodel 30 may include the quality balance between modalities (vision andaudio) and timing effects correlating between the modalities. Suchtiming effects include sequencing (event sequences in one modalityaffecting user sensitivity to events in another) and synchronisation(correlation between events in different modalities).

One key component of the multi-modal model of the present invention issynchronisation. This part of the model is shown in FIG. 2. The degreeof synchronisation between the inputs is determined in a synchronisationmeasurement unit 38. This takes inputs from the visual sensory layer(input 38 v) and the audible sensory layer (input 38 a) relating to therespective delays in the two signals. The synchronisation measurementunit 38 determines the difference in the two delays and generates anoutput 38 v representative of the relative delay between the twosignals. This, rather than the absolute delay in either signal, is theperceptually significant. Such lack of synchronisation has beendetermined in prior art systems but, as will be discussed, theperceptual importance of such synchronisation errors varies according tothe nature of the stimulus.

To this end, the cross-modal model 300 also uses information about theaudio and video data streams (inputs 35, 36), and optionally the taskbeing undertaken, (input 37) to determine the subjectively of anysynchronisation errors.

In this embodiment the objective parameters describing the audiocomponents of the signals are audio descriptors generated from the input35 in processor 31. These audio descriptors are RMS energy over asuccession of overlapping short intervals of predetermined length, andsignal peak and decay parameters. These values give an indication of thegeneral shape and duration of individual audio events.

The parameters describing the video components are video descriptorsgenerated from the input 36 in a processor 32, such as motion vectors,see for example chapter 5 in [Netravali A N, Haskell B G, “DigitalPictures; representation and compression”, Plenum Press, ISBNO-306-42791-5. June 1991], and a persistence parameter describing thesubjective importance, and the decay of this importance with time.

These parameters are used by a further processor 33 to determine thenature of the content of the stimulus, and generate therefrom asynchronisation error perceptibility value, which is output (39) to theperceptual model 40, along with the actual value of the synchronisationerror (output 38 s). The perceptual model 40 can then compare thesynchronisation error value with the perceptibility value to generate aperceptual quality value, which contributes to a cross-modal combiningfunction fn_(pm) to be used by the perceptual model 40.

A mathematical structure for the model can be summarised:

E_(da1), E_(da2), . . . , E_(dan) are the audio error descriptors, and

E_(dv1), E_(dv2), . . . , E_(dvn) are the video error descriptors.

Then, for a given task:

fn_(aws) is the weighted function to calculate audio error subjectivity,

fn_(vws) is the weighted function to calculate video error subjectivity,and

fn_(pm) is the cross-modal combining function previously discussed. Thisfunction may include other weightings, to account for other cross-modalfactors, for example quality mismatches and task-related factors.

The task-specific perceived performance metric, PM, output from themodel 40 is then:PM=fn_(pm) [fn_(aws) {E_(da1), E_(da2), . . . , E_(dan)}, fn_(vws){E_(dv1), E_(dv2), . . . , E_(dvn)}]

The perceptual layer modal 40 may be configured for a specific task, ormay be configurable by additional variable inputs T_(wa), T_(wv) to themodel (inputs 41, 42), indicative of the nature of the task to becarried out, which varies the weightings in the function fn_(pm)according to the task. For example, in a video-conferencing facility,the quality of the audio signal is generally more important than that ofthe visual signal. However, if the video conference switches from a viewof the individuals taking part in the conference to a document to bestudied, the visual significance of the image becomes more important,affecting what weighting is appropriate between the visual and auditoryelements. These values T_(wa), T_(wv) may also be fed back to thesynchronisation perception measuring function 38, to allow thesynchronisation error subjectivity to vary according to the taskinvolved. High level cognitive preconceptions associated with the task,the attention split between modalities, the degree of stress introducedby the task, and the level of experience of the user all have an effecton the subjective perception of quality.

The functions fn_(aws), fn_(vws) may themselves be made functions of thetask weightings, allowing the relative importance of individualcoefficients E_(da1), E_(dv1) etc to be varied according to the taskinvolved giving a prediction of the performance metric, PM′ as:PM′=fn′_(pm) [fn′_(aws) {E_(da1), E_(da2), . . . , E_(dan), T_(wa)},fn′_(vws) 55 E_(dv1), E_(dv2), . . . , E_(dvn), T_(wv)}]

A multi-dimensional description of the error subjectivity in theauditory and visual modalities is thereby produced.

In the arrangement of FIG. 5 an avatar 29 represented on a screen isgenerated from an audio input 10, from which the speech content isderived, and an input 20 which provides the basic gesture data for theanimiation of the avatar in the animation unit 28. The audio input 10 isalso supplied to a speaker system 19, and to the animation process. Theprocess requires the selection of the viseme (facial arrangement)appropriate to the sound being uttered (element 27) and this is used tocontrol the animation unit 28. It is desirable to have the visemessynchronised with the sound, but the animation process makes thisdifficult to achieve. Some visemes are more tolerant of synchronisationerrors than others, and so by applying the audio input 10 and theidentity of the selected viseme to the synchronisation model 30 (FIG. 2)this tolerance can be determined, and used to control the animationprocess 28, for example by extending or shortening the duration of themore tolerant visemes, to allow better synchronisation of theless-tolerant visemes.

In these embodiments the values derived in the processor 33 depend onthe stimulus type. A selection of experimental results showing thisinter-relationship is presented below in order to illustrate theinfluences of stimulus on the perceptual relevance of synchronisationerrors.

FIG. 3 shows the number of subjects detecting a synchronisation erroraveraged across three stimulus types. These types are:

(1) an object entering and leaving the field of vision, as an example ofa brief visual cue;

(2) an object entering and remaining in the field of vision, as anexample of a longer visual cue, and

(3) a speech cue (talking head).

Each visual cue is accompanied by an audible cue generated by the objectin the visual cue.

It will be seen from FIG. 3 that there is an underlying feature oftemporal asymmetry in the perceptibility of synchronisation errors.Synchronisation errors in which the audio signal leads the visual signalare perceptually more important than those in which the visual signalleads the audio signal by the same interval. This is probably because weare used to receiving audio cues later than corresponding visual cues inordinary experience, since in the natural world the associated physicalsignals travel at vastly different speeds (340 metres/second for soundand 300 million metres/second for light).

The general form of the results reflects the recommended synchronisationthresholds given in Recommendation J.100 of the ITU, i.e. 20milliseconds for audio lead and 40 milliseconds for audio lag. Thisrecommendation provides a fixed figure for all content types and isintended to ensure that synchronisation errors remain imperceptible.This approach is suitable for the specification of broadcast systems.

However, it has been found that synchronisation error detection isgreater for a long visual cue than for a short visual cue or a visualspeech cue. FIG. 4 shows the results for these two stimulus types, andfor a “talking head”, which is a special case because human subjects arehighly specialised for speech perception compared with more generalcontent. The two non-speech sound stimuli selected were both relativelyabrupt, as these make greater demands on synchronisation than would acontinuous noise.

These are shown on a single graph for ease of comparison.

The key features of these results are:

(i) The general trend in error detection asymmetry is apparent for allstimulus types .

(ii) The duration/distinctness of the long (“axe”) stimulus, in whichthe object generating the sound appears, and then remains in view,results in greater probability of error detection than for the shorter(“pen”) stimulus, in which the object appears with the sound but rapidlygoes out of view again.

(iii) Error detection for the speech (“Marilyn”) stimulus is consistentwith the other two stimuli when the audio lags the video, but is greaterthan for either of the other stimuli when the audio leads the video.

The probability of synchronisation error detection therefore varies withthe duration and distinctness of the visual stimulus. Moreover, there isa high sensitivity to synchronisation errors in speech when the audiosignal leads the video. This latter result was not expected, since ithas been previously argued that during speech perception it is notpossible to resolve the timing of “error” events more accurately thanthe duration of the semantic elements of the speech stream, see forexample Chapter 7 in [Handel, S. “Listening: an introduction to theperception of auditory events”, MIT Press, 1989]. It appears in practicethat, perhaps due to the short duration of certain semantic units suchas consonant onsets, subjects are very sensitive to audio leadsynchronisation errors with talking-head/speech stimuli.

1. A method of determining the subjective quality of an audio-visualstimulus, comprising: measuring the actual synchronisation errorsbetween the audio and visual elements of the stimulus; identifyingcharacteristics of audio and visual cues in th stimulus that areindicative of the significance of synchronization errors; generating ameasure of subjective quality from said synchronisation errors andcharacteristics; analysing the audio and visual elements of the stimulusfor the presence of said characteristic features indicative of thesignificance of synchronisation errors; and modifying the measure ofsubjective quality derived from the synchronisation errors andcharacteristics according to whether said characteristics features arepresent.
 2. A method according to claim 1, wherein the characteristicsof the audio and visual cues are used to generate one or moresynchronisation error tolerance values.
 3. A method as claimed in claim2, wherein the audio-visual stimulus is monitored for occurrences ofsynchronisation errors exceeding said tolerances values.
 4. A methodaccording to claim 3, wherein the means generating the stimulus iscontrolled to maintain the synchronisation in a predeterminedrelationship with the said tolerance values.
 5. A method according toclaim 4, wherein the resulting measure of subjective quality is used tocontrol the operation of an avatar animation process.
 6. Apparatus fordetermining the subjective quality of an audio-visual stimulus,comprising: means for measuring the actual synchronisation errorsbetween the audio and visual elements of the stimulus; means foridentifying characteristics of audio and visual elements of the stimulusthat are indicated of the significance of synchronisation errors; meansfor generating a measure of subjective quality from said synchronisationerrors and characteristics; means for analysing the audio and visualelements of the stimulus for the presence of said characteristicsfeatures indicative of the significance of synchronisation errors; andmeans for modifying the measure of subjective quality derived from thesynchronisation errors and characteristics according to whether saidcharacteristic features are present.
 7. Apparatus according to claim 6,wherein the means for identifying cue characteristics generates one ormore synchronisation error tolerance values.
 8. Apparatus as claimed inclaim 7, comprising means for monitoring the audio-visual stimulus foroccurrences of synchronisation errors exceeding said tolerance values.9. Apparatus according to claim 8, comprising means for controlling themeans generating the stimulus to maintain the synchronisation in apredetermined relationship with the said tolerance values.
 10. Apparatusaccording to claim 9, further comprising animation process meanscontrolled by the subjective quality measurement means to generate ananimated image.
 11. A method according to claim 1, wherein theaudio-visual stimulus is a “talking-head.”
 12. An apparatus according toclaim 6, wherein the audio-visual stimulus is a “talking-head.”